上篇的實作雖然最終有個簡單的小結果,但終究是臨陣磨槍。
最主要的原因呢,當然就是...
其實光是弄前面的東西(API)和搞懂TFRecord就花一堆時間 QQ
藉口啦! 一切都是藉口啦!
對啦! 藉口,所以今天就可以順理成章地把過程貼上來完成今天的發文了 (灑花~
就不說認真如我,還是稍微地給他修改一些程式碼,用用心心地重新實作一遍耶!!!
那麼來簡單說說下面的內容會有什麼吧!
單純地測試性質,我使用了兩層的LSTM+Dropout後面在接3層的FC
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 90, 50)            10400     
_________________________________________________________________
dropout (Dropout)            (None, 90, 50)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 90, 50)            20200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 90, 50)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 30)                9720      
_________________________________________________________________
dense (Dense)                (None, 40)                1240      
_________________________________________________________________
dense_1 (Dense)              (None, 20)                820       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                210       
=================================================================
Total params: 42,590
Trainable params: 42,590
Non-trainable params: 0
_________________________________________________________________
繼上篇的
daily-historical-stock-prices-1970-2018Dataset。
在使用TFRecord之前我先將.CSV檔案轉存為.npy檔案,省了$45%$以上的空間($2GB->1.13GB$)
names = pd.read_csv(dataset_path + dataset_dict['names'], engine='python')
companies = names.ticker.unique()
np.save('companies.npy', companies)
prices = pd.read_csv(dataset_path + dataset_dict['prices'], engine='python')
prices = prices.values
np.save('dataGen_full.npy', prices)
import os
import tensorflow as tf
import numpy as np
companies = np.load('companies.npy', allow_pickle=True)
prices = np.load('dataGen_full.npy', allow_pickle=True)
def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _float32_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def create_tfrecords(writer, days, stock_sc):
    stock_sc = np.asarray(stock_sc, np.float32)
    max_days = len(stock_sc) - 10
    
    for i in range(days, max_days):
        x = stock_sc[i-days:i, 0]
        y = stock_sc[i:i+10, 0]
        feature = {
            'x': _bytes_feature([x.tostring()]),
            'y': _bytes_feature([y.tostring()])}
        example = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(example.SerializeToString())
# Data Normalization
from sklearn.preprocessing import MinMaxScaler
num_choose = 20
count = len(companies)
rand_company_indexes = np.random.randint(count, size=num_choose)
companies_choose = companies[rand_company_indexes]
# open    close    adj_close    low    high    volume
scaler = MinMaxScaler()
days_before = 90
save_name = './testing.tfrecord'
with tf.python_io.TFRecordWriter(save_name) as writer:
    for company in companies_choose:
        indexes = prices[:, 0] == company
        stock = prices[indexes, 1:]
        training_data = stock[:, 2:3]
        if len(training_data) < 100:
            continue
        # print(training_data.shape)
        ### scaler
        get_sc = scaler.fit_transform(training_data)
        # print(get_sc)
        create_tfrecords(writer=writer, days=days_before, stock_sc=get_sc)
import tensorflow as tf
file = "training.tfrecord"
record_iterator = tf.python_io.tf_record_iterator(path=file)
count = 0
for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record)
    count += 1
    print(example)
    # Exit after 1 iteration as this is purely demonstrative.
    break
print('total of count : ', count)
 
tf.data API引入訓練資料一開始的參數設定,將batch設為3000,單純是參考別人的寫法
上篇只有使用200而已,感覺非常不夠,學不起來...(憑感覺講的
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
reshape_size = 90
length = 352983
batch_size = 3000
def extract_features(example, reshape_size):
    features = tf.parse_single_example(
        example,
        features={
            'x': tf.FixedLenFeature([], tf.string),
            'y': tf.FixedLenFeature([], tf.string),
        }
    )
    stock = tf.decode_raw(features['x'], tf.float32)
    stock = tf.reshape(stock, [reshape_size])
    stock = tf.cast(stock, tf.float32)
    stock = tf.expand_dims(stock, -1)
    label = tf.decode_raw(features['y'], tf.float32)
    label = tf.reshape(label, [10])
    label = tf.cast(label, tf.float32)
    return stock, label
tfrecords_path = './training.tfrecord'
dataset = tf.data.TFRecordDataset(tfrecords_path)
dataset = dataset.map(lambda x: extract_features(x, reshape_size))
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.repeat()
train_gen = dataset.make_initializable_iterator()
可以透過summary()來查看自己的模型!!
# LSTM Training
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
model = Sequential()
model.add(LSTM(units = 50, return_sequences = True, input_shape = (reshape_size, 1)))
model.add(Dropout(0.2))
model.add(LSTM(units = 50, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 30, return_sequences = False))
model.add(Dense(units = 40))
model.add(Dense(units = 20))
model.add(Dense(units = 10))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.summary()
這邊特別使用callback函數將我的參數記錄下來
下方352983這個數字單純是資料個數,總共有如此多筆
from tensorflow import keras
epochs = 20
checkpoint_path = './first_check/'
with tf.Session() as sess:
    sess.run(train_gen.initializer)
    checkpoint_file = checkpoint_path+"/cp-{epoch:04d}.ckpt"
    cp_callback = keras.callbacks.ModelCheckpoint(checkpoint_file, save_weights_only=True, verbose=1, period=1)
    train = model.fit(train_gen, epochs = epochs, batch_size = batch_size, 
                      steps_per_epoch=352983 // batch_size, callbacks=[cp_callback])
 
今天仍沒有進行到測試階段 因為今天剛好有事情...來不及用QQ
最主要還是嘗試使用TFRecord看看,真的是超級難滴~
為什麼我一定要用TFRecord?
其實我並不是很想用,因為它很繁瑣、生成資料又慢...
唯一的優點是在讀取TFRecord的時候,並不需要將整份文件讀到內存
問號? 什麼意思?
所以儘管有這麼多缺點,卻都被其優點所無視了 !!
而我也是因為考慮之後可能會使用到TFRecord,那不如現在就先踩踩坑...
希望踩個5天我可以繼續我的股票預測 QQ
我還想用GCP上的API啊!!!
Using TFRecords and tf.Example